Principal Component Analysis (PCA)

  • Up until now we've only talked about supervised methods.
    • What were these again?
  • Now we want to discuss unsupervised methods that highlight aspects of data without known labels.
  • Fundamentally PCA is a dimensionality reduction method.
  • As such it may be used for example for feature extraction, visualization and noise filtering.

In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Intuition of PCA

They way PCA works is easiest explained by visualizing it's behaviour. So let's print a two dimensional dataset:


In [ ]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
  • By eye, what can we say about this dataset?
  • Is there a relationship with between x and y? and if so, what type/kind of relationship is it?
  • We had similar datset in the Introducing Scikit-Learn. What did we do then?
  • PCA tries to create a list of the principal axes in the data.
  • Let's employ Scikit-Learn do that for us:

In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)

PCA learns what the components are an how variance is explained by them.


In [ ]:
print(pca.components_)

In [ ]:
print(pca.explained_variance_)
  • What could that possibly mean? Any ideas?
  • When it's plotted it becomes clearer

In [ ]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
  • What are you seeing? There are three important aspects:
    • The direction,
    • the origin,
    • and the length of the vectors.
  • Each vector is a principal component.
  • The length indicates how "important" this component is. Mathmatically: It's the variance of the data projected onto that principal axes.
  • The direction indicates the position of the principal component.
  • The origin is the mean of the data in any dimension.

What do the principal components of the following dataset look like?


In [ ]:
X = rng.randn(250, 2)
plt.scatter(X[:, 0], X[:, 1])

In [ ]:
# fit estimator
pca = PCA(n_components=2)
pca.fit(X)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=1)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)

Dimensionality reduction

So, how can we use that in order to reduce the dimensionality of our dataset?

Note: We want to cancel out dimensions in a way that the distance between datapoints is preserved as good as possible.

Let's have a look:


In [ ]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

In [ ]:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.5)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.5)
plt.axis('equal');

What has happend?

  • The information along the least important principal axis or axes is removed
  • The component(s) of the data with the highest variance remain.
  • The fraction of variance that is cut out is roughly a measure of how much "information" is discarded in this reduction of dimensionality.

What does that mean?

  • This reduced-dimension dataset is "good enough" to encode the most important relationships between the points
  • Despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

PCA as dimensionality reduction: Iris dataset

Recall: The iris dataset is four dimensional


In [ ]:
import seaborn as sns
iris = sns.load_dataset('iris')
X_iris = iris.drop('species', axis=1)
y_iris = iris['species']
iris.head()

In [ ]:
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)

In [ ]:
colormap = y_iris.copy()
colormap[colormap == 'setosa'] = 'b'
colormap[colormap == 'virginica'] = 'r'
colormap[colormap == 'versicolor'] = 'g'

plt.scatter(X_2D[:, 0], X_2D[:, 1], c=colormap)
plt.xlabel('PCA1')
plt.xlabel('PCA2')

What do we see from this plot?

  • In the two-dimensional representation, the species are fairly well separated.
  • Remember, the PCA algorithm had no knowledge of the species labels!
  • Classification will probably be effective on the dataset.

PCA as Noise Filtering: Digits dataset

  • PCA can be used to filter noise
  • The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise.
  • If you reconstruct the data using just the components explaing the most variance, you should be preferentially keeping the signal and throwing out the noise.

Let's see how this looks with the digits data. First we will plot several of the input noise-free data:


In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()

def plot_digits(data):
    fig, axes = plt.subplots(4, 10, figsize=(10, 4),
                             subplot_kw={'xticks':[], 'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8, 8),
                  cmap='binary', interpolation='nearest',
                  clim=(0, 16))

plot_digits(digits.data)

Now, let's add some noise:


In [ ]:
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
  • Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:

In [ ]:
pca = PCA(0.50).fit(noisy)
pca.n_components_

In [ ]:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
  • This signal preserving and noise filtering property makes PCA a very useful feature selection routine
  • Rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
  • This explains this